Improved Sampling Techniques for Learning an Imbalanced Data Set
نویسندگان
چکیده
This paper presents the performance of a classifier built using the stackingC algorithm in nine different data sets. Each data set is generated using a sampling technique applied on the original imbalanced data set. Five new sampling techniques are proposed in this paper (i.e., SMOTERandRep, Lax Random Oversampling, Lax Random Undersampling, Combined-Lax Random Oversampling Undersampling, and Combined-Lax Random Undersampling Oversampling) that were based on the three sampling techniques (i.e., Random Undersampling, Random Oversampling, and Synthetic Minority Oversampling Technique) usually used as solutions in imbalance learning. The metrics used to evaluate the classifier’s performance were F-measure and G-mean. Fmeasure determines the performance of the classifier for every class, while G-mean measures the overall performance of the classifier. The results using F-measure showed that for the data without a sampling technique, the classifier’s performance is good only for the majority class. It also showed that among the eight sampling techniques, RU and LRU have the worst performance while other techniques (i.e., RO, C-LRUO and C-LROU) performed well only on some classes. The best performing techniques in all data sets were SMOTE, SMOTERandRep, and LRO having the lowest Fmeasure values between 0.5 and 0.65. The results using Gmean showed that the oversampling technique that attained the highest G-mean value is LRO (0.86), next is C-LROU (0.85), then SMOTE (0.84) and finally is SMOTERandRep (0.83). Combining the result of the two metrics (F-measure and G-mean), only the three sampling techniques are considered as good performing (i.e., LRO, SMOTE, and SMOTERandRep)
منابع مشابه
Data Mining for Imbalanced Datasets: An Overview
A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs...
متن کاملLearning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets
In this report, I presented my results to the tasks of 2008 UC San Diego Data Mining Contest. This contest consists of two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also ...
متن کاملEnhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining
This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...
متن کاملAn Effective Method for Imbalanced Time Series Classification: Hybrid Sampling
Most traditional supervised classification learning algorithms are ineffective for highly imbalanced time series classification, which has received considerably less attention than imbalanced data problems in data mining and machine learning research. Bagging is one of the most effective ensemble learning methods, yet it has drawbacks on highly imbalanced data. Sampling methods are considered t...
متن کاملA Selective Sampling Method for Imbalanced Data Learning on Support Vector Machines
The class imbalance problem in classification has been recognized as a significant research problem in recent years and a number of methods have been introduced to improve classification results. Rebalancing class distributions (such as over-sampling or under-sampling of learning datasets) has been popular due to its ease of implementation and relatively good performance. For the Support Vector...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1601.04756 شماره
صفحات -
تاریخ انتشار 2016